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In the analysis of time series from nonlinear sources, mutual information (MI) is used as a 
nonlinear statistical criterion for the selection of an appropriate time delay in time delay re- 
construction of the state space. MI is a statistic over the sets of sequences associated with the 
dynamical source, and we examine here the distribution of MI, thus going beyond the familiar 
analysis of its average alone. We give for the first time the distribution of MI for a standard, 
classical communications channel with Gaussian, additive white noise. For time series analysis 
of a dynamical system, we show how to determine the distribution of MI and discuss the impli- 
cations for the use of average mutual information (AMI) in selecting time delays in phase space 
reconstruction. 

PACS numbers: PACS Numbers: 84.40.Ua,89.70.-|-c,05.45.-a,05.10.L,05.45.Tp 

Information theory |^ characterizes general dy- 

namical systems through a nonhnear connection between 
sequences of symbols associated with action of the sys- 
tem. The connection could be between inputs and out- 
puts of a channel or could be associated with predictions 
of future measurements from past observations jj] . Shan- 
non's identification of MI as the essential statistic in such 
systems gives a framework for the discussion of appUca- 
tions as diverse as fiber optic communications systems 
with time scales shorter than nanoseconds and nervous 
systems with time scales of a few tens of milliseconds to 
seconds. 

In the analysis of time series coming from nonlinear 
sources ||, |j one reconstructs a proxy state space using 
observations of a single variable V{t). A useful recon- 
structed state space employs the observed variable V{t) 
and its time delays to form a data vector 

y{t) = [V(t), V[t - Tt,), ...,V{t~{d- 1)Tt,)], (1) 

where is the sampling time and T is an integer. We 
must choose T so that the components of y{t) are 'inde- 



pendent enough' of each other to be good coordinates for 
the space. For this purpose 

/(T)= P{V{t),V{t + Trs)) 

{y(t),V(t-(-Tr,)} 

f PiVit),Vit + TT,)) \ 
^^\P{V{t))P{V{t + TTs))j' ^' 

the AMI, has been used. Following the suggestion of 
Fraser and Swinney T is chosen where I(T) has its 
first minimum. 

The AMI /(T) answers the question: how much, in 
bits, do we know about the measurement at t-\-TTs from 
the measurement at t, averaged over the whole time series 
or attractor. However, it does not tell us how MI is 
distributed over the time series. We have only the hope 
that the distribution of MI is sharp enough to provide a 
choice for T which is useful over the whole time series. 

If the dynamics of the process producing V{t) has two 
time scales, for example, one might expect the distribu- 
tion of MI to reflect that by exhibiting two distinct peaks 
associated with each process efficiently connecting to it- 
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self through the dynamics of the system. In such a case 
the AMI may tell us little about how to select a value 
of T to use in making y(t). Since the MI whose mean is 
evaluated in Eq.(|^) tells us how well we can predict V{t) 
knowing V{t — T), the presence of two time scales leading 
to a multivalued distribution of MI is quite natural. 

In such circumstances, it may not be a good idea to 
use the T chosen by the AMI, but to select different T 
in different parts of the attractor, or to use another pre- 
scription altogether for choosing T. As many physically 
or biologically interesting dynamical systems have two or 
more relevant time scales, the distribution of MI presents 
an interesting issue. 

Time delay phase space reconstruction is widely used 
as an initial step in the analysis of time series from non- 
linear sources, so it is useful to establish how MI, consid- 
ered as a 'statistic' distributed over the measurements, is 
distributed. This will allow one to proceed well beyond 
the use of the average alone. 

Our considerations here may provide a framework for 
efficient use of a broad set of communications channels. 
Presented with a channel, which need not be stationary, 
one may wish to know those sequences which maximize 
MI to allow the best use of that channel. Those sequences 
need not always correspond to the AMI associated with 
the channel. In other words the distribution of MI can 
depend both on the channel and on the symbol sequences 
conveyed by the channel. 

MI provides a nonlinear relation between one sequence 
of symbols and second sequence of symbols. These could 
be the input and output, respectively, of a commu- 
nications system or of a network of neurons perform- 
ing a functional task. We consider a sequence S = 
{..., s(— 1), s(0), s(l), ...} observed at discrete times and 
ask about its nonlinear connection to a second sequence 
R = {...,r(-l),r(0),r(l), ...}. The symbols {r{l)} and 
{s(fc)} may take discrete or continuous values. 

We are interested determining the properties of S from 
measurements of the sequence R. The ability to do this 
is characterized by the MI 



/(s(fc),r(0) =log 



PsR{s{k),r{l)) 
Ps{s{k))PR{r{l)) 



(3) 



PsR{s,r) is the joint distribution function of symbols s 
taken from the input sequence S and symbols r taken 
from the output sequence R. Psj?(s,r) is the essen- 
tial ingredient in determining MI. It depends both on 
the sequences S and R and, importantly, on the chan- 
nel or dynamical process connecting the s{k) to the r{l). 
Ps{s) = J2rPsR{s,r) and Pnir) = J2sPsR{s,r) are 
the distributions of S symbols and R symbols, respec- 
tively. When the sequences are independent, Psr{s, r) = 
Ps{s)PR{r), and/(s,r) = 0. The MI /(s,r) is a vari- 
able over (s,r), and it has its own distribution function 
Pj (x) . Pi {x) tells us the frequency with which the value 



X = I{s, r) occurs. 

The distribution of MI is defined by 

Pi{x)= f dsdrPsR{s,r)S{x-I{s,r)), (4) 



with /(s, r) = log 



— J dxxPiix) — 



Ps{s)PR{r) 

J ds dr PsR{s,r)I{s,r), as it should be. While (x) > 0, 
negative values of /(r, s) are possible Negative values 
of a; = I{r, s) reflect the circumstance that (r, s) pairs 
may occur less frequently than the individual symbols 
themselves. If a process produces out-of-phase appear- 
ances of r and s, for example, negative values of MI will 
occur. 

To evaluate Pi{x) using the observed PsR{s,r) 
we could solve the delta function condition x — 
/(s,r) for, say, s■^,{x,r) and then perform the inte- 
gral over r. An alternative method is to Fourier 
transform Pi{x) giving Qi{f) — J dx e~'^^^^^ Pi{x) = 
J ds dr e~'^^^^^^^'^^PsR{s,r). Pi{x) is recovered by in- 
verse Fourier transform Pi{x) = J dfe'^^'^-^^Qi{f). In ef- 
fect we are using the Fourier variable / as a Lagrange 
multiplier to implement the required delta function on 
X — /(s,r). We avoid solving for s*(r, x) and require 
only an integral over the same ingredients used in evalu- 
ating the AMI. 

A classical example is provided by the Gaussian distri- 
bution in (r, s) 



PsRis,r) = 



{ar^+bs-'+2a-sr) 



(5) 



The quantity ^ = ^ < 1 for this to be normahzable. 

Mr) = ./^e~r'(^'^),Ps{s) = ySe-^(i-C), 

and I{s,r) = Xq ~ + ^ + 2crrs), where Xq = 

-i log(l - 0- When cr = 0, 1{r, s) = 0. 



From this we flnd Qi{f) 



r(/) 



(6) 



As expected, when ^ = 0, Qi{f) = 1 and Pi{x) = 5{x). 
Pi{x) can also be evaluated for ^ 7^ 



(7) 



where Kq{z) is a zeroth order modified Bessel func- 
tion Ko(z) is symmetric in z and has a logarithmic 
singularity at z = 0. The AMI is Xq, {{x - XaY) ^ 
and {{x - Xo)2'") = ^^"((2™ - l)!!)^. 

A standard example of a Gaussian channel is given 
by the case of a Gaussian input signal with Ps{s) — 

e"3^, and with conditional probability of response 



PsR{r\s) 



This is represents a statistical 
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signal transported through a passive channel with addi- 
tive, Gaussian, white noise |l. The signal to noise ratio 
in this channel is 

The distribution of mutual information is that just 
given with a = ^■,b ^ ^{l + ^); cr = leading 
to the AMI Xo = \ \og{l + §), which is famihar (|, and 
moments, which are a new result, given as above with 

^ = -f^. This means that for a large S/N, the AMI 

grows logarithmically in S/N , but, perhaps surprisingly, 
the moments about this average are order unity or larger. 
That is the distribution is not sharp, but becomes broad. 



For small, X( 



^ ^"^"^ moments are powers of 



j^. In fact, for small or noisy channels, the RMS value 
of MI is twice the mean value. 

Other PRs{r,s) must be dealt with numerically, 
and the sequence values must be discrete. From 

PsR{s{k),r{l)) a MI matrix is generated: 



/(s(fc),r(0)-log2 



PsR{s{k),r{l)) 
Psis{k))PR{r{l)) 



(8) 



and Q{f) is approximated by 



E ^^5i?(s(fc),r(0)e-*2.//(.(fe),KO)ArAs 
= Y,Pi{xk)e-^^'f^''^Xk, (9) 

k 

where As and Ar are the size of the symbol bins. 
Pi{x) is evaluated bins of size A/. We denote — 
Pi{xm) = Pi{mAI + Imin), whcrc m is an integer 
that ranges from to Nbin and Imin is the mini- 
mum value of MI. pm, (|] is determined for Ntin val- 
ues of /„ = AT " AT - integer in the range 
[—■7^Ni,in,^Ni,in]. The approximation to Q{fn) is then 

Q(/„) = A/ e-2'r«/™.„/„ g-27rzfcA/^ go ^j^g 

verse Fourier transform of exp {_^^~~^^ is Pm- 

To apply this to the MI used in selecting the time delay 
in phase space reconstruction we identify s(t) = V{t() + 
kTs) = s{k) and r{t) = V{to + (l + T)ts) = r{l) where 
to is some initial time in the data. We have investigated 
Pi{x) for various T for two simple dynamical systems: 
(1) the Lorenz system, and (2) a nonlinear hysteretic 
circuit using data provided by L. Pecora and T. Carroll 
(Personal Communication). Both data sets are discussed 
in|. 

For the Lorenz system data set we used 2x10^ values 
of the variable x{t). The first minimum of AMI for these 
data is at T = 10 in units defined by the numerical inte- 
gration step. We evaluated Pi {x) for various T. In Figure 
1 we show this distribution for T = 1, for 9 < T < 11, and 
for T = 16. We can see that for T = 1 Pi{x) has a single 
well defined peak, indicating little variation of MI across 
the attractor. The AMI for this distribution is larger 



than that for T « 10. This is the usual indication that 
T = 1 is not a preferred choice. For 9 < T < 11 P/(a;) is 
nearly the same with two well formed, but nearby, peaks. 
These we attribute to fast and slow motions on the at- 
tractor associated with motion about the unstable fixed 
points and motion through the "neck" near the original 
in phase space. As the peaks are rather close, the use 
of a single T to provide coordinates for the data vector 
y{t) over the whole attractor would appear to be a good 
construct. If one wished to resolve the two time scales 
exposed by the peaks in Pi{x), the use of two different 
time delay coordinate systems in different parts of the 
attractor would be appropriate. 
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FIG. 1: Distribution of MI for the Lorenz system for various 
T used in the data vector y(t). The left panel has T = 1, the 
middle panel 9 < T < 11, and the right panel has T — 16. 
The AMI for T = 1 is larger than for T 10. For T = 1 and 
for 9 < T < 11 we have rather sharp distributions Pi{x), so 
the use of a single value of T across the attractor is useful. For 
T ~ 10 the bimodal nature of Pi(x) suggests that one could 
identify two processes contributing to Pi{x) and seek a value 
for T for each of them. For T — 16 the distribution is so broad 
that no particular meaning can be associated with the AMI. 
This Pi (x) indicates many different processes contribute over 
the attractor, and thus, this is not a good value of T to use 
in forming the data vector. 

For larger T, here T = 16, we observe substantial 
broadening in Pi(x). Here the mutual information be- 
tween V{t) and V{t -\- Tts) is more or less uniformly 
distributed over the attractor suggesting all points are 
rather decorrelated, in an information theoretic sense, 
from all others. This means that all dynamical processes 
on the attractor convey little information from time t to 
time t + T. Equivalently, coordinates of the time delay 
data vector y{t) formed with T = 16 are so independent 
of each other as to constitute a bad choice for following 
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the dynamics of the system. 

For the nonhnear circuit data we have 64,000 data 
points taken at = 0.1ms. The minimum of AMI is 
T = 6. For these data Pi{x) is shown in Figure 2 for 
T = 1, for 5 < T < 7, and for T = 11. We have a rela- 
tively sharp peak for T = 1, but high values of AMI. The 
peak narrows for T « 6, and then broadens again as T 
grows. We show T = 11 where the broadening of Pi{x) 
seen already in the Lorcnz system occurs again. This in- 
dicates that this value of T is not useful in reconstructing 
the entire attractor. 




FIG. 2: Distribution of MI for the hysteretic circuit for vari- 
ous T used in the data vector y(t). The left panel has T = 1, 
the middle panel 5 < T < 7, and the right panel has T = 11. 
The AMI for r = 1 is larger than for T 6. Each of these is 
a rather sharp distribution Pj(x), so the use of a single value 
of T across the attractor is useful. For T = 11 Pi{x) has 
broadened indicating this value of T yields components of the 
data vector y{t) which may be too independent for use. 

Pi{x) for various models and for a variety of experi- 
mental data will be reported in our larger paper In 
this short note we have introduced the distribution of MI 
and exhibited several examples, analytic and numerical, 
to illustrate its properties. The use of Pi{x) as illustrated 
here for understanding the distribution of MI in the con- 
nection between elements of the data vector y(t) gives us 
a clearer understanding of the choice made some years 
ago by Swinney and Fraser |Q of using the first mini- 
mum of AMI to select T. It is not only low AMI among 
components of y{t) that is important, but also one must 
have a distribution of MI which is sharp so that a sin- 
gle choice for T over the whole attractor can accurately 
capture the underlying dynamical processes. 



We expect Pi{x) to have direct utility in the study 
of networks with active dynamical elements |l^ where 
the function of the network is to convey information from 
a set of source symbols s{k) to a set of response symbols 
r[l) with high fidelity. This fidelity is characterized by 
high values of MI between input and output, and maxima 
of Pi{x), rather than the AMI, will indicate which pro- 
cesses s{k) r{l) are best communicated. These ideas 
will be fully explored in our larger paper . 
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